library(haven)
library(tidyverse)
library(readxl)
library(tigris)
library(tidycensus)
library(sf)
library(leaflet)
library(gridExtra)
library(RColorBrewer)
library(kableExtra)
options(tigris_use_cache = TRUE)
options(progress_enabled = FALSE)Another comparison of two area-level socioeconomic deprivation indices: SVI (Social Vulnerability Index) and ADI (Area Deprivation Index) and their associations with asthma and smoking outcomes from survey data
BMIN503/EPID600 Final Project
This report is still under construction, but ready for peer feedback.
1 Overview
The goal of this project is to compare the Agency for Toxic Substances and Disease Registry Social Vulnerability Index (SVI) and the University of Wisconsin-developed Area Deprivation Index (ADI) in their association with asthma and smoking outcomes from the Behavioral Risk Factor Surveillance System (BRFSS) CDC dataset.
For SVI, the most recent available dataset is from 2020. For this project, national ranking at the county level will be used. The most recent ADI dataset is from 2021. National ranking for ADI is provided at the census block group level. For the asthma and smoking outcomes in the BRFSS CDC dataset, the 2021 Selected Metropolitan/Micropolitan Area Risk Trends (SMART) subset will used. The SMART BRFSS dataset provides the CBSA (Core-based statistical area), which “consist of the county or counties (or equivalent entities) associated with at least one core (urban area) of at least 10,000 population, plus adjacent counties having a high degree of social and economic integration with the core as measured through commuting ties.” https://www.census.gov/programs-surveys/metro-micro/about/glossary.html
2 Introduction
As Rollings et al 2023 discuss, there exist several area-level socioeconomic deprivation indices without consensus within healthcare research and policy fields about which indices should be used for a variety of analyses. In comparing two frequently used indices, SVI and ADI, in their associations with SMART BRFSS dataset, this project has the potential to inform selection of the appropriate index for future work. This project is interdisciplinary in that it spans data science, social science, and biomedical informatics.
A challenge for this project, as with many projects which incorporate geospatial variables, is differing geographic units. For this project, SVI is at the county level, ADI is at the census block group level, and the BRFSS outcomes are at the CBSA level. Fortunately, all of these geographic units are census-defined and are supersets or subsets of each other. The current plan is to link these datasets at the county level. Based on guidance from Dr. Sherrie Xie, ADI will be aggregated to the county-level using a weighted average that incorporates census counts for census block groups. The BRFSS smoking and asthma outcomes will be aggregated at the CBSA level. CBSAs and CBSA level smoking and asthma outcomes will be mapped to counties. An alternative approach uses weighted averages to aggregate both SVI and ADI to the CBSA level, which can also be explored.
Once the linked dataset is prepared, logistic regression models will be used to compare SVI and ADI as predictors of the aggregated smoking and asthma outcomes. Spatial autoregressive regression, as used by Tipirneni et al 2022 will also be considered. For smoking, the smoking status variable _SMOKER3 will be aggregated at the CBSA level. For asthma, the variable for lifetime asthma prevalence _LTASTH1 will be aggregated at the CBSA level. More information about the smoking and asthma variables available in the SMART BRFSS dataset is provided in Section 7.
There are several limitations of the project. One clear limitation is that the analyses will be restricted to areas of the US which are included in a CBSA grouping included in the SMART BRFSS dataset.
3 Methods
3.1 SMART BRFSS CDC dataset
# MMSA2021.xpt retrieved from https://www.cdc.gov/brfss/smart/smart_2021.html
# Accompanying documentation available at https://www.cdc.gov/brfss/annual_data/2021/pdf/2021_smart_brfss_mmsa_methodology-508.pdf
MMSA2021 <- read_xpt("data/input_data/mmsa/MMSA2021_XPT/MMSA2021.xpt")MMSA2021 |> glimpse()# construct table of variable names and labels
mmsa_colname_labels <- MMSA2021 %>%
map_dfc(attr, "label") |>
pivot_longer(everything(),
names_to = "colname",
values_to = "label")
# include variable names without labels
mmsa_colnames_all <- MMSA2021 |>
colnames() |>
as_tibble() |>
rename(colname = value) |>
left_join(mmsa_colname_labels, by = join_by(colname))
# add missing labels documented in https://www.cdc.gov/brfss/annual_data/2021/pdf/2021_smart_brfss_mmsa_methodology-508.pdf
attr(MMSA2021[["_MMSA"]], "label") <- toupper("the code of metropolitan or micropolitan statistical area where the respondent lives")
attr(MMSA2021[["MMSANAME"]], "label") <- toupper("the MMSA name")
attr(MMSA2021[["_MMSAWT"]], "label") <- toupper(" the MMSA-level weight that is used when generating MMSA-level estimates for variables in the data set")
mmsa_colnames_all <- MMSA2021 %>%
map_dfc(attr, "label") |>
pivot_longer(everything(),
names_to = "colname",
values_to = "label")# rename columns to make them easier to work with using R (not requiring ``)
# underscore prefix indicates derived or computed, replace with d_
smart_2021 <- MMSA2021 |>
rename_with(~ tolower(gsub("_", "d_", .x, fixed = TRUE)))smart_2021 |> glimpse()# https://www.cdc.gov/brfss/annual_data/2021/pdf/2021_smart_brfss_mmsa_methodology-508.pdf
# Typically, BRFSS data are used to produce state-level estimates; however, for the SMART
# project, BRFSS data were used to produce small area-level estimates for MMSAs as defined by
# the US Census Bureau. On June 6, 2003, OMB issued new definitions for MMSA and
# metropolitan divisions. OMB periodically updates the list of MMSAs. The list of areas used for this
# analysis can be found here: https://www.census.gov/geographies/reference-files/timeseries/demo/metro-micro/delineation-files.html. The March 2020 release was used for defining
# the MMSAs in the 2021 SMART data set.
# retrieved from https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/historical-delineation-files.html
delin_2020 <- read_xls("data/input_data/mmsa/list1_2020.xls",
range = "A3:L1919") |>
rename_with(~ tolower(gsub(" ", "_", .x, fixed = TRUE))) |>
rename_with(~ tolower(gsub("/", "_", .x, fixed = TRUE)))
delin_2020_cbsa_join <- delin_2020 |> filter(is.na(metropolitan_division_code)) # TODO explore why this is necessary
delin_2020_mdc_join <- delin_2020 |> filter(!is.na(metropolitan_division_code)) # TODO explore why this is necessary
# TODO check this
mmsa_county_mapping_cbsa <- smart_2021 |>
mutate(d_mmsa_char = as.character(d_mmsa)) |>
distinct(d_mmsa, d_mmsa_char, d_mmsa_char, mmsaname) |>
inner_join(delin_2020_cbsa_join, by = c("d_mmsa_char" = "cbsa_code")) |>
distinct(d_mmsa, d_mmsa_char, csa_title, county_county_equivalent, fips_state_code, fips_county_code)
mmsa_county_mapping_mdc <- smart_2021 |>
mutate(d_mmsa_char = as.character(d_mmsa)) |>
distinct(d_mmsa, d_mmsa_char, d_mmsa_char, mmsaname) |>
inner_join(delin_2020_mdc_join, by = c("d_mmsa_char" = "metropolitan_division_code")) |>
distinct(d_mmsa, d_mmsa_char, csa_title, county_county_equivalent, fips_state_code, fips_county_code)
mmsa_county_mapping <- mmsa_county_mapping_cbsa |>
dplyr::union(mmsa_county_mapping_mdc) |>
distinct()smart_2021 |> summarize(n_distinct(d_mmsa)) |> pull() == mmsa_county_mapping |> summarize(n_distinct(d_mmsa)) |> pull()
# check for issues with leading zeros
smart_2021 |>
filter(nchar(d_mmsa) != 5)
smart_2021 |>
mutate(d_mmsa_char = as.character(d_mmsa)) |>
filter(nchar(d_mmsa_char) != 5)
mmsa_county_mapping |>
filter(nchar(d_mmsa_char) != 5)
delin_2020 |>
filter(nchar(cbsa_code) != 5)tigris_counties <- counties()
mmsa_county_mapping_geoid <- mmsa_county_mapping |>
mutate(smart = TRUE) |>
inner_join(tigris_counties, by = c("fips_state_code"= "STATEFP", "fips_county_code" = "COUNTYFP")) |>
mutate(smart = replace_na(smart, FALSE))
mmsa_county_mapping_geoid |> filter(smart == TRUE) |> nrow() == mmsa_county_mapping |> nrow()[1] TRUE
mmsa_county_mapping_geoid_sf <- mmsa_county_mapping_geoid |>
filter(!fips_state_code %in% c("02",
"15", "60",
"66", "69",
"72", "78")) |> # limit to contiguous states and DC
st_as_sf()
# simple map
# mmsa_county_mapping_geoid_sf |>
# st_simplify(dTolerance = 1e3) |>
# ggplot(aes(fill = smart)) +
# geom_sf()# Before we create the leaflet maps, we will change the CRS (Coordinate Reference System) to 4326 because leaflet expects the data provided to it to be in WGS84, which is EPSG:4326. We can do the transformation with st_transform.
# This is slow so I did it once and saved the RDS
mmsa_county_mapping_geoid_sf_4326 <- st_transform(mmsa_county_mapping_geoid_sf, crs = 4326)
mmsa_county_mapping_geoid_sf_4326 |> saveRDS("data/intermediate_data/mmsa_county_mapping_geoid_sf_4326.rds")mmsa_county_mapping_geoid_sf_4326 <- readRDS("data/intermediate_data/mmsa_county_mapping_geoid_sf_4326.rds")pal_fun <- colorFactor(topo.colors(2), mmsa_county_mapping_geoid_sf_4326$smart)
mmsa_county_mapping_geoid_sf_4326 |>
st_simplify(dTolerance = 1e3) |>
leaflet() |>
addPolygons(
stroke = FALSE,
# remove polygon borders
fillColor = ~ pal_fun(smart),
fillOpacity = 0.5,
smoothFactor = 0.5,
# increase opacity and resolution
popup = mmsa_county_mapping_geoid_sf_4326$county_county_equivalent
) |>
addProviderTiles(providers$CartoDB.Voyager) |> # add third party provider tile
addLegend(
"bottomright",
# location of legend
pal = pal_fun,
# palette function
values = ~ smart,
# variable to pass to palette function
title = 'Included in SMART BRFSS dataset',
# legend title
opacity = 1 # legend opacity (1 = completely opaque)
) |>
addScaleBar()3.1.1 Aggregating smoking and lung cancer variables at CBSA level
3.1.1.1 _SMOKER3
To be added
3.1.1.2 _LTASTH1
To be added
3.2 SVI dataset
# retrieved from https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html
svi2020 <- read_csv("data/input_data/svi/SVI_2020_US_county.csv")svi2020 |> glimpse()svi2020_sf <- svi2020 |>
select(RPL_THEME1,
RPL_THEME2,
RPL_THEME3,
RPL_THEME4,
RPL_THEMES,
FIPS) |>
left_join(tigris_counties, by = c("FIPS" = "GEOID"))
st_as_sf()
svi2020_sf_4326 <- st_transform(svi2020_sf, crs = 4326)
svi2020_sf_4326 |> saveRDS("data/intermediate_data/svi2020_sf_4326.rds")svi2020_sf_4326 <-
readRDS("data/intermediate_data/svi2020_sf_4326.rds") |>
filter(!STATEFP %in% c("02",
"15", "60",
"66", "69",
"72", "78")) # limit to contiguous states and DCpal_svi <- colorNumeric("inferno", NULL, reverse = TRUE)
svi2020_sf_4326 |>
filter(!is.na(FUNCSTAT)) |> # TODO explore why this is necessary
st_simplify(dTolerance = 1e3) |>
leaflet() |>
addPolygons(
stroke = FALSE,
# remove polygon borders
fillColor = ~ pal_svi(RPL_THEME1),
fillOpacity = 0.5,
smoothFactor = 0.5
# increase opacity and resolution
) |>
addProviderTiles(providers$CartoDB.Voyager) |> # add third party provider tile
addLegend(
"bottomright",
# location of legend
pal = pal_svi,
# palette function
values = ~ RPL_THEMES,
# variable to pass to palette function
title = 'SVI',
# legend title
opacity = 1 # legend opacity (1 = completely opaque)
) |>
addScaleBar()3.3 ADI dataset
# retrieved from https://www.neighborhoodatlas.medicine.wisc.edu/
adi2021 <- read_csv("data/input_data/adi/adi-download/US_2021_ADI_Census_Block_Group_v4_0_1.csv")
tigris_cbg <- block_groups(cb = TRUE) |>
filter(!STATEFP %in% c("02",
"15", "60",
"66", "69",
"72", "78"))adi2021 |> glimpse()adi2021_sf <- adi2021 |>
inner_join(tigris_cbg, by = c("FIPS" = "GEOID")) |>
st_as_sf()
adi2021_sf_4326 <- st_transform(adi2021_sf, crs = 4326)
adi2021_sf_4326 |> saveRDS("data/intermediate_data/adi2021_sf_4326.rds")adi_no_cbg <- adi2021 |>
anti_join(tigris_cbg, by = c("FIPS" = "GEOID"))
cbg_no_adi <- tigris_cbg |>
anti_join(adi2021, by = c("GEOID" = "FIPS"))
# When a Census block group falls into one or more of the suppression criteria mentioned above the ADI rank is replaced with a code describing the suppression reason. Three possible codes will appear in the ADI field: "PH" for suppression due to low population and/or housing, "GQ" for suppression due to a high group quarters population, and "PH-GQ" for suppression due to both types of suppression criteria. A code of "QDI" designates block groups without an ADI due to Questionable Data Integrity, stemming from missing data in the source ACS data.adi2021_sf_4326 <- readRDS("data/intermediate_data/adi2021_sf_4326.rds")contig_state_vctr <- datasets::state.abb |>
setdiff(c("AK", "HI"))
# https://walker-data.com/umich-workshop-2022/intro-2020-census/#1
# get cbg counts to compute cbg weights for aggregating adi at county level
cbg_counts <- list()
for (contig_state in contig_state_vctr) {
cbg_counts[[contig_state]] <- get_decennial(
geography = "block group",
variables = "P1_001N",
state = contig_state,
year = 2020,
output = "wide"
) |>
mutate(state_abb = contig_state)
}
cbg_counts |> saveRDS("data/intermediate_data/cbg_counts.RDS")cbg_counts <- readRDS("data/intermediate_data/cbg_counts.RDS") |>
bind_rows()
# computed cbg weights for county level
cbg_wts <- tigris_cbg |>
select(STATEFP, COUNTYFP, TRACTCE, BLKGRPCE, GEOID) |>
st_drop_geometry() |>
inner_join(cbg_counts, by = "GEOID") |>
group_by(STATEFP, COUNTYFP) |>
# county pop
mutate(county_pop = sum(P1_001N)) |>
ungroup() |>
# cbg pop / county pop
mutate(cbg_wt = P1_001N/county_pop) |>
arrange(GEOID)
cbg_wts |> saveRDS("data/intermediate_data/cbg_wts.RDS")cbg_wts <- readRDS("data/intermediate_data/cbg_wts.RDS")
# compute weighted average for adi at county level
county_adi <- adi2021_sf_4326 |>
select(-STATEFP, -COUNTYFP, -TRACTCE, -BLKGRPCE, -NAME) |>
inner_join(cbg_wts, by = c("FIPS" = "GEOID")) |>
mutate(ADI_NATRANK = as.numeric(ADI_NATRANK)) |>
filter(!is.na(ADI_NATRANK)) |>
# national adi rank * cbg wt
mutate(adi_wts = ADI_NATRANK * cbg_wt) |>
group_by(STATEFP, COUNTYFP) |>
# weighted average sum of weighted adis
summarize(county_adi_value = sum(adi_wts))
county_adi |> saveRDS("data/intermediate_data/county_adi.RDS")
# cbg_counts_comb <- cbg_counts |>
# bind_rows()
# cbg_counts_comb |> ggplot(aes(x = P1_001N)) + geom_histogram()county_adi <- readRDS("data/intermediate_data/county_adi.RDS")pal_adi <- colorNumeric("plasma", NULL)
county_adi |>
st_simplify(dTolerance = 1e3) |>
leaflet() |>
addPolygons(
stroke = FALSE,
# remove polygon borders
fillColor = ~ pal_adi(county_adi_value),
fillOpacity = 0.5,
smoothFactor = 0.5
# increase opacity and resolution
) |>
addProviderTiles(providers$CartoDB.Voyager) |> # add third party provider tile
addLegend(
"bottomright",
# location of legend
pal = pal_adi,
# palette function
values = ~ county_adi_value,
# variable to pass to palette function
title = 'adi',
# legend title
opacity = 1 # legend opacity (1 = completely opaque)
) |>
addScaleBar()3.4 Combine datasets
To be added
3.5 Modelling
To be added
4 Results
To be added
5 Conclusion
6 References
Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Questionnaire. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2021.
Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, 2021.
Centers for Disease Control and Prevention/ Agency for Toxic Substances and Disease Registry/ Geospatial Research, Analysis, and Services Program. CDC/ATSDR Social Vulnerability Index 2020 Database US. https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html. Accessed on 2023/11/04.
Kind AJH, Buckingham W. Making Neighborhood Disadvantage Metrics Accessible: The Neighborhood Atlas. New England Journal of Medicine, 2018. 378: 2456-2458. DOI: 10.1056/NEJMp1802313. PMCID: PMC6051533.
Rollings KA, Noppert GA, Griggs JJ, Melendez RA, Clarke PJ. Comparison of two area-level socioeconomic deprivation indices: Implications for public health research, practice, and policy. PLoS One. 2023 Oct 5;18(10):e0292281. doi: 10.1371/journal.pone.0292281. PMID: 37797080; PMCID: PMC10553799.
Tipirneni R, Schmidt H, Lantz PM, Karmakar M. Associations of 4 Geographic Social Vulnerability Indices With US COVID-19 Incidence and Mortality. Am J Public Health. 2022 Nov;112(11):1584-1588. doi: 10.2105/AJPH.2022.307018. Epub 2022 Sep 15. PMID: 36108250; PMCID: PMC9558191.
University of Wisconsin School of Medicine and Public Health. 2021 Area Deprivation Index v4.0.1. Downloaded from https://www.neighborhoodatlas.medicine.wisc.edu/ 2023/11/04
Xie S, Himes BE. Approaches to Link Geospatially Varying Social, Economic, and Environmental Factors with Electronic Health Record Data to Better Understand Asthma Exacerbations. AMIA Annu Symp Proc. 2018 Dec 5;2018:1561-1570. PMID: 30815202; PMCID: PMC6371292.
Xie S, Hubbard RA, Himes BE. Neighborhood-level measures of socioeconomic status are more correlated with individual-level measures in urban areas compared with less urban areas. Ann Epidemiol. 2020 Mar;43:37-43.e4. doi: 10.1016/j.annepidem.2020.01.012. Epub 2020 Feb 11. PMID: 32151518; PMCID: PMC7160852.
7 Appendices
7.1 Session Info
sessionInfo()R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kableExtra_1.3.4 RColorBrewer_1.1-3 gridExtra_2.3 leaflet_2.2.0
[5] sf_1.0-14 tidycensus_1.5 tigris_2.0.4 readxl_1.4.3
[9] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2
[13] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[17] ggplot2_3.4.3 tidyverse_2.0.0 haven_2.5.3
loaded via a namespace (and not attached):
[1] gtable_0.3.4 xfun_0.40 htmlwidgets_1.6.2
[4] tzdb_0.4.0 leaflet.providers_2.0.0 vctrs_0.6.3
[7] tools_4.3.1 crosstalk_1.2.0 generics_0.1.3
[10] parallel_4.3.1 proxy_0.4-27 fansi_1.0.4
[13] pkgconfig_2.0.3 KernSmooth_2.23-21 rematch_1.0.1
[16] webshot_0.5.5 uuid_1.1-1 lifecycle_1.0.3
[19] farver_2.1.1 compiler_4.3.1 munsell_0.5.0
[22] htmltools_0.5.6 class_7.3-22 yaml_2.3.7
[25] jquerylib_0.1.4 pillar_1.9.0 crayon_1.5.2
[28] ellipsis_0.3.2 classInt_0.4-10 viridis_0.6.4
[31] wk_0.8.0 tidyselect_1.2.0 rvest_1.0.3
[34] digest_0.6.33 stringi_1.7.12 fastmap_1.1.1
[37] grid_4.3.1 colorspace_2.1-0 cli_3.6.1
[40] magrittr_2.0.3 utf8_1.2.3 e1071_1.7-13
[43] withr_2.5.1 scales_1.2.1 rappdirs_0.3.3
[46] bit64_4.0.5 timechange_0.2.0 rmarkdown_2.24
[49] httr_1.4.7 bit_4.0.5 cellranger_1.1.0
[52] hms_1.1.3 evaluate_0.21 knitr_1.43
[55] viridisLite_0.4.2 s2_1.1.4 rlang_1.1.1
[58] Rcpp_1.0.11 glue_1.6.2 DBI_1.1.3
[61] xml2_1.3.5 vroom_1.6.3 svglite_2.1.1
[64] rstudioapi_0.15.0 jsonlite_1.8.7 R6_2.5.1
[67] systemfonts_1.0.4 units_0.8-4
7.2 Asthma and smoking BRFSS variables
The table below summarizes the variables of interest, combining information from https://www.cdc.gov/brfss/annual_data/2021/summary_matrix_21.html and https://www.cdc.gov/brfss/annual_data/2021/pdf/2021-calculated-variables-version4-508.pdf
| Output Variables (In Final Data Set) |
Description or Result of Calculation |
Values | |||||||||||||||
| _LTASTH1 | CV for lifetime asthma prevalence / Calculated variable for adults who have ever been told they have asthma. _LTASTH1 is derived from ASTHMA3. |
|
|||||||||||||||
| _CASTHM1 | CV for current asthma prevalence / Calculated variable for adults who have been told they currently have asthma. _CASTHM1 is derived from ASTHMA3 and ASTHNOW. |
|
|||||||||||||||
| _ASTHMS1 | Computed asthma status / Calculated variable for computed asthma status. _ASTHMS1 is derived from ASTHMA3 and ASTHNOW |
|
|||||||||||||||
| _SMOKER3 | Smoking Status: Everyday smoker, someday smoker, former smoker and non smoker / Calculated variable for four-level smoker status: everyday smoker, someday smoker, former smoker, non-smoker. _SMOKER3 is derived from SMOKE100 and SMOKDAY2. |
|
|||||||||||||||
| _RFSMOK3 | CV for current smoking status / _RFSMOK3 Calculated variable for adults who are current smokers. _RFSMOK3 is derived from _SMOKER3. |
|
|||||||||||||||
| _CURECI1 | CV for current e-cigarette smoker status / Calculated variable for adults who are current e-cigarette users. _CURECI1 is derived from ECIGNOW1. |
|